1. Data preparation¶
# sentiment
categories = ['nostalgia', 'not nostalgia']
# download data from web
import pandas as pd
df = pd.read_csv("hf://datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data.csv")
# observe data
print(df)
X = df.rename(columns={'sentiment': 'sentiment_name'})
sentiment comment 0 not nostalgia He was a singer with a golden voice that I lov... 1 nostalgia The mist beautiful voice ever I listened to hi... 2 nostalgia I have most of Mr. Reeves songs. Always love ... 3 not nostalgia 30 day leave from 1st tour in Viet Nam to conv... 4 nostalgia listening to his songs reminds me of my mum wh... ... ... ... 1495 not nostalgia i don't know!..but the opening of the video,..... 1496 not nostalgia it's sad this is such a beautiful song when yo... 1497 not nostalgia Dear Friend, I think age and time is not that ... 1498 nostalgia I was born in 1954 and started to be aware of ... 1499 nostalgia This is the first CD I bought after my marriag... [1500 rows x 2 columns]
# my functions
import helpers_homework.data_mining_helpers as dmh
# 將類別文字轉換成數字
X['sentiment'] = X['sentiment_name'].apply(lambda t: dmh.format_labels_number(t, X))
# 更換排版順序
X = X[['sentiment','comment', 'sentiment_name']]
X[0:10]
| sentiment | comment | sentiment_name | |
|---|---|---|---|
| 0 | 1 | He was a singer with a golden voice that I lov... | not nostalgia |
| 1 | 0 | The mist beautiful voice ever I listened to hi... | nostalgia |
| 2 | 0 | I have most of Mr. Reeves songs. Always love ... | nostalgia |
| 3 | 1 | 30 day leave from 1st tour in Viet Nam to conv... | not nostalgia |
| 4 | 0 | listening to his songs reminds me of my mum wh... | nostalgia |
| 5 | 0 | Every time I heard this song as a child, I use... | nostalgia |
| 6 | 0 | My dad loved listening to Jim Reeves, when I w... | nostalgia |
| 7 | 0 | i HAVE ALSO LISTENED TO Jim Reeves since child... | nostalgia |
| 8 | 1 | Wherever you are you always in my heart | not nostalgia |
| 9 | 1 | Elvis will always be number one no one can com... | not nostalgia |
2. Data Mining¶
2.1 Missing Data processing¶
# 確認資料中有沒有遺失數值
X.isnull().apply(lambda x: dmh.check_missing_values(x))
| sentiment | comment | sentiment_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 0 |
資料中沒有遺失數值,不用進行處理
2.2 Dealing with Duplicate Data¶
# 確認有無重複資料
sum(X.duplicated())
1
# 找尋重複資料
X[X.duplicated(keep=False)]
| sentiment | comment | sentiment_name | |
|---|---|---|---|
| 62 | 1 | never heard this song before... WOW What an am... | not nostalgia |
| 78 | 1 | never heard this song before... WOW What an am... | not nostalgia |
X.drop_duplicates(keep='first', inplace=True) #刪除重複的行,保留第一次出現的
X.reset_index(drop=True, inplace=True) # 從新排列索引
# 重新打印確認
X
| sentiment | comment | sentiment_name | |
|---|---|---|---|
| 0 | 1 | He was a singer with a golden voice that I lov... | not nostalgia |
| 1 | 0 | The mist beautiful voice ever I listened to hi... | nostalgia |
| 2 | 0 | I have most of Mr. Reeves songs. Always love ... | nostalgia |
| 3 | 1 | 30 day leave from 1st tour in Viet Nam to conv... | not nostalgia |
| 4 | 0 | listening to his songs reminds me of my mum wh... | nostalgia |
| ... | ... | ... | ... |
| 1494 | 1 | i don't know!..but the opening of the video,..... | not nostalgia |
| 1495 | 1 | it's sad this is such a beautiful song when yo... | not nostalgia |
| 1496 | 1 | Dear Friend, I think age and time is not that ... | not nostalgia |
| 1497 | 0 | I was born in 1954 and started to be aware of ... | nostalgia |
| 1498 | 0 | This is the first CD I bought after my marriag... | nostalgia |
1499 rows × 3 columns
3. Data processing¶
3.1 Sampling¶
X_sample = X.sample(n = 750) # random state
import matplotlib.pyplot as plt
%matplotlib inline
# 計算兩個 dataset 的類別統計
X_counts = X.sentiment_name.value_counts()
X_sample_counts = X_sample.sentiment_name.value_counts()
# 找到所有的類別,並確保兩組資料對齊
all_categories = X_counts.index
# 設定 bar 寬度和位置
bar_width = 0.2
index = range(len(all_categories))
# 繪製第一組資料的柱狀圖
plt.bar(index, X_counts, bar_width, label='Dataset X')
# 繪製第二組資料的柱狀圖,並將其向右偏移
plt.bar([i + bar_width for i in index], X_sample_counts, bar_width, label='Dataset X_sample')
# 設定標題和標籤
plt.title('Category distribution ')
plt.xticks([i + bar_width / 2 for i in index],all_categories, rotation=0)
# 添加圖例
plt.legend()
# 顯示圖表
plt.show()
可以看見兩個類別的分布是 1:1 ,在整體的會分懷舊少一筆主要原因是因為有重複的資料進行了移除
3.2 Feature Creation¶
import nltk
# takes a like a minute or two to process
X['unigrams'] = X['comment'].apply(lambda x: dmh.tokenize_text(x))
3.3 CountVectorizer¶
3.3.1 Feature subset selection¶
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.comment)
feature_terms = count_vect.get_feature_names_out()
tdm_df = pd.DataFrame(X_counts.toarray(), columns=feature_terms, index=X.index)
X_counts.shape # document=1499, feature=3730
(1499, 3730)
# 進行目前特徵的矩陣做一些觀察
plot_x = ["term_"+str(i) for i in feature_terms[0:20]]
plot_y = ["doc_"+ str(i) for i in list(X.index)[0:20]]
plot_z = X_counts[0:20, 0:20].toarray() # X_counts[how many documents, how many terms]
# 使用熱圖觀察
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(5, 5))
ax = sns.heatmap(df_todraw,
cmap="PuRd", #熱圖的顏色映射為粉紅色調
vmin=0, vmax=1, annot=True) #annot 熱圖的每個格子中顯示數據值
plt.show()
可以看見前20個特徵在前20個文檔的出現度很低
3.3.2 Attribute Transformation¶
# 找出所有特徵出現的頻率
import numpy as np
term_frequencies = []
term_frequencies = np.asarray(X_counts.sum(axis=0))[0] # 列的方向進行計算
term_frequencies # 所有特徵出現的次數
feature_terms # 所有特徵名字
array(['00', '000', '045', ..., 'yup', 'zealand', 'zulus'], dtype=object)
原始數據的觀察
# 觀察前300特徵分布(原始數據)
plt.close() # 關掉先前的圖表
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=feature_terms[:300], y=term_frequencies[:300])
g.set_xticks(range(300)) # 設置 x 軸的刻度位置
g.set_xticklabels(feature_terms[:300], rotation = 90);
plt.show()
# 使用動態圖表觀察原始數據(由大到小前300大)
import plotly.express as px
plt.close()
data = pd.DataFrame({'Terms': feature_terms, 'Frequencies': term_frequencies})
top_data = data.nlargest(300, 'Frequencies').sort_values(by='Frequencies', ascending=False) # 選擇出現次數前50高的
fig = px.bar(top_data, x='Terms', y='Frequencies', title='Top 300 Most Frequent Terms', text='Frequencies')
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(xaxis_tickangle=-90)
fig.show()
原始數據取log的觀察
# calculate log data frequency
import math
term_frequencies_log = [math.log(i) for i in term_frequencies]
# 觀察前300分布(log數據)
plt.close() # 關掉先前的圖表
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=feature_terms[:300], y=term_frequencies_log[:300])
g.set_xticks(range(300)) # 設置 x 軸的刻度位置
g.set_xticklabels(feature_terms[:300], rotation = 90);
plt.show()
# 使用動態圖表觀察log數據(由大到小前300大)
import plotly.express as px
plt.close()
data = pd.DataFrame({'Terms': feature_terms, 'Frequencies': term_frequencies_log})
top_data = data.nlargest(300, 'Frequencies').sort_values(by='Frequencies', ascending=False) # 選擇出現次數前50高的
fig = px.bar(top_data, x='Terms', y='Frequencies', title='Top 300 Most Frequent Terms', text='Frequencies')
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(xaxis_tickangle=-90)
fig.show()
3.3.3 Attribute Aggregation¶
找出各類別的原始特徵¶
category_numbers = [0,1]
category_dfs = {}
for category in categories:
category_dfs[category] = X[X['sentiment_name'] == category].copy()
# 定義生成稀疏陣列的方程式
def create_term_document_df_CountVector(df,min_df=0.0,max_df=1.0):
count_vect_temp = CountVectorizer(min_df=min_df, max_df=max_df) # Initialize the CountVectorizer
X_counts_temp = count_vect_temp.fit_transform(df['comment']) # Transform the text data into word counts
words_temp = count_vect_temp.get_feature_names_out()
term_document_df_temp = pd.DataFrame(X_counts_temp.toarray(), columns=words_temp)
return term_document_df_temp
# 分別對兩個類別找屬於他們的特徵
filt_term_document_dfs = {}
for category in categories:
filt_term_document_dfs[category] = create_term_document_df_CountVector(category_dfs[category])
# 顯示稀疏陣列
for category in categories:
print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
print(filt_term_document_dfs[category])
Filtered Term-Document Frequency DataFrame for Category nostalgia:
07 10 11 11th 12 13 14 15 16 17 ... young younger youngster \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
1 0 0 0 0 0 0 0 0 0 1 ... 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
.. .. .. .. ... .. .. .. .. .. .. ... ... ... ...
745 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
746 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
747 0 0 0 0 0 1 0 0 0 0 ... 0 0 0
748 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
749 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
your yours youth youthful youtube yrs yup
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0
.. ... ... ... ... ... ... ...
745 0 0 0 0 0 0 1
746 0 0 0 0 0 0 0
747 0 0 0 0 0 0 0
748 0 0 0 0 0 0 0
749 0 0 0 0 0 0 0
[750 rows x 2295 columns]
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
00 000 045 10 100 10m 11 12 14 15 ... youngest youngsters \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0
.. .. ... ... .. ... ... .. .. .. .. ... ... ...
744 0 0 0 0 0 0 0 0 0 0 ... 0 0
745 0 0 0 0 0 0 0 0 0 0 ... 0 0
746 0 0 0 0 0 0 0 0 0 0 ... 0 0
747 0 0 0 0 0 0 0 0 0 0 ... 0 0
748 0 0 0 0 0 0 0 0 0 0 ... 0 0
your yourself youth youtube yrs yuo zealand zulus
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
.. ... ... ... ... ... ... ... ...
744 0 0 0 0 0 0 0 0
745 0 0 0 0 0 0 0 0
746 0 0 0 0 0 0 0 0
747 1 0 0 0 0 0 0 0
748 0 0 0 0 0 0 0 0
[749 rows x 2602 columns]
可以知道 nostalgia 最初有 2295 特徵;not nostalgia 有 2602 特徵
for category in categories:
word_counts = filt_term_document_dfs[category].sum(axis=0).to_numpy()
plt.close()
plt.figure(figsize=(10, 6))
plt.hist(word_counts, bins=100,color='blue', edgecolor='black')
plt.title(f'Term Frequency Distribution for Category {category}')
plt.xlabel('Frequency')
plt.ylabel('Number of Terms')
plt.xlim(1, 200)
plt.show()
在兩個類別的圖片中可以發現其實大多特徵都是僅出現比較少次,我們要排除超級少的特徵以及多到不行的
刪除原始數據中過低與過高的特徵¶
# 建立刪除過小出現與過大出現次數的
def filter_top_bottom_words_by_sum(term_document_df, top_percent=0.05, bottom_percent=0.01):
word_sums = term_document_df.sum(axis=0)
sorted_words = word_sums.sort_values()
total_words = len(sorted_words)
top_n = int(top_percent * total_words) #過濾掉頻率高的數值
bottom_n = int(bottom_percent * total_words) #過濾掉頻率低的數值
words_to_remove = pd.concat([sorted_words.head(bottom_n), sorted_words.tail(top_n)]).index
# print(f'Bottom {bottom_percent*100}% words: \n{sorted_words.head(bottom_n)}') #Here we print which words correspond to the bottom percentage we filter
# print(f'Top {top_percent*100}% words: \n{sorted_words.tail(top_n)}') #Here we print which words correspond to the top percentage we filter
return term_document_df.drop(columns=words_to_remove)
# 進行刪除最前面與最後面
term_document_dfs = {}
for category in categories:
print(f'\nFor category {category} we filter the following words:')
term_document_dfs[category] = filter_top_bottom_words_by_sum(filt_term_document_dfs[category])
print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
print(term_document_dfs[category])
For category nostalgia we filter the following words:
Filtered Term-Document Frequency DataFrame for Category nostalgia:
07 10 11 11th 12 13 14 15 16 17 ... yo yokel younger \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
1 0 0 0 0 0 0 0 0 0 1 ... 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
.. .. .. .. ... .. .. .. .. .. .. ... .. ... ...
745 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
746 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
747 0 0 0 0 0 1 0 0 0 0 ... 0 0 0
748 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
749 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
youngster your yours youth youthful youtube yrs
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 0 0 0 0 0 0
.. ... ... ... ... ... ... ...
745 0 0 0 0 0 0 0
746 0 0 0 0 0 0 0
747 0 0 0 0 0 0 0
748 0 0 0 0 0 0 0
749 0 0 0 0 0 0 0
[750 rows x 2159 columns]
For category not nostalgia we filter the following words:
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
000 045 10 100 10m 11 12 14 15 150 ... younger youngest \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0
.. ... ... .. ... ... .. .. .. .. ... ... ... ...
744 0 0 0 0 0 0 0 0 0 0 ... 0 0
745 0 0 0 0 0 0 0 0 0 0 ... 0 0
746 0 0 0 0 0 0 0 0 0 0 ... 0 0
747 0 0 0 0 0 0 0 0 0 0 ... 0 0
748 0 0 0 0 0 0 0 0 0 0 ... 0 0
youngsters yourself youth youtube yrs yuo zealand zulus
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
.. ... ... ... ... ... ... ... ...
744 0 0 0 0 0 0 0 0
745 0 0 0 0 0 0 0 0
746 0 0 0 0 0 0 0 0
747 0 0 0 0 0 0 0 0
748 0 0 0 0 0 0 0 0
[749 rows x 2446 columns]
經過刪除之後我們要的檔案特徵變成:2602 -> 2446 ; 2295 ->2159
# 將現在我要的特徵儲存進csv方便取用
from PAMI.extras.convert.DF2DB import DF2DB
for category in term_document_dfs:
category_safe = category.replace(' ', '_')
obj = DF2DB(term_document_dfs[category])
obj.convert2TransactionalDatabase(f'./td_freq_db/td_freq_db_{category_safe}.csv', '>=', 1)
# transactional Dataset Observe
from PAMI.extras.dbStats import TransactionalDatabase as tds
def observaion_transactional_Database(name):
plt.close()
name = name.replace(' ', '_')
obj = tds.TransactionalDatabase(f'./td_freq_db/td_freq_db_{name}.csv')
print(f'Transational Dataset {name}:')
obj.run()
obj.printStats()
obj.plotGraphs()
plt.show()
for category in categories:
observaion_transactional_Database(category)
Transational Dataset nostalgia: Database size (total no of transactions) : 734 Number of items : 2159 Minimum Transaction Size : 1 Average Transaction Size : 8.693460490463215 Maximum Transaction Size : 39 Standard Deviation Transaction Size : 7.213372063492091 Variance in Transaction Sizes : 52.10372252435774 Sparsity : 0.9959733855996001
Transational Dataset not_nostalgia: Database size (total no of transactions) : 745 Number of items : 2446 Minimum Transaction Size : 1 Average Transaction Size : 8.410738255033557 Maximum Transaction Size : 46 Standard Deviation Transaction Size : 5.926429722323316 Variance in Transaction Sizes : 35.16977700801039 Sparsity : 0.9965614316210002
augmented_df 的製作 by FPGrowth, FAE topK, MaxFPGrowth¶
# use FPGrowth with minsup
from PAMI.frequentPattern.basic import FPGrowth as alg
def FPGrowth_minsup(minSup,name):
obj = alg.FPGrowth(iFile=f'./td_freq_db/td_freq_db_{name}.csv', minSup=minSup)
obj.mine()
frequentPatternsDF_temp = obj.getPatternsAsDataFrame()
print(name)
print('Total No of patterns: ' + str(len(frequentPatternsDF_temp)))
print('Runtime: ' + str(obj.getRuntime()))
obj.save(f'./freq_patterns_minsup/freq_patterns_{name}_minSup{minSup}.txt') #save the patterns
return frequentPatternsDF_temp
# use FAE topK
from PAMI.frequentPattern.topk import FAE
def FAE_topK(k,name):
obj = FAE.FAE(iFile=f'./td_freq_db/td_freq_db_{name}.csv', k=k)
obj.mine()
frequentPatternsDF_temp = obj.getPatternsAsDataFrame()
print(name)
print('Total No of patterns: ' + str(len(frequentPatternsDF_temp)))
print('Runtime: ' + str(obj.getRuntime()))
obj.save(f'./freq_patterns_topK/freq_patterns_{name}_topK{k}.txt') #save the patterns
return frequentPatternsDF_temp
# use MaxFPGrowth with minsup
from PAMI.frequentPattern.maximal import MaxFPGrowth as algm
def FPGrowth_max(minSup,name):
obj = algm.MaxFPGrowth(iFile=f'./td_freq_db/td_freq_db_{name}.csv', minSup=minSup)
obj.mine()
frequentPatternsDF_temp = obj.getPatternsAsDataFrame()
print(name)
print('Total No of patterns: ' + str(len(frequentPatternsDF_temp)))
print('Runtime: ' + str(obj.getRuntime()))
obj.save(f'./freq_patterns_max/freq_patterns_{name}_max_minSup{minSup}.txt') #save the patterns
return frequentPatternsDF_temp
def pattern_integrate(frequentPatternsDF):
dfs = []
for category in categories:
dfs.append(frequentPatternsDF[category])
combined_df = pd.concat(dfs, ignore_index=True)
pattern_counts = combined_df['Patterns'].value_counts()
unique_patterns = pattern_counts[pattern_counts == 1].index
final_pattern_df = combined_df[combined_df['Patterns'].isin(unique_patterns)].sort_values(by='Support', ascending=False)
# print(final_pattern_df)
# print(f"Number of patterns discarded: {(len(pattern_counts) - len(unique_patterns))*2}") # Count of discarded patterns
return final_pattern_df
def augmented_df_generation(final_pattern_df):
X['tokenized_comment'] = X['comment'].str.split().apply(set)
pattern_matrix = pd.DataFrame(0, index=X.index, columns=final_pattern_df['Patterns'])
for pattern in final_pattern_df['Patterns']:
pattern_words = set(pattern.split()) # Tokenize pattern into words
pattern_matrix[pattern] = X['tokenized_comment'].apply(lambda x: 1 if pattern_words.issubset(x) else 0)
augmented_df = pd.concat([tdm_df, pattern_matrix], axis=1) # 結合上方找出的特徵
return augmented_df
# 使用 minsup = 3去找出模型訓練的特徵
frequentPatternsDF_minsup = {}
for category in categories:
category_save = category.replace(' ', '_')
frequentPatternsDF_minsup[category] = FPGrowth_minsup(3,category_save)
print (frequentPatternsDF_minsup[category])
final_pattern_df_minsup = pattern_integrate(frequentPatternsDF_minsup)
augmented_df_minsup = augmented_df_generation(final_pattern_df_minsup)
augmented_df_minsup
Frequent patterns were generated successfully using frequentPatternGrowth algorithm
nostalgia
Total No of patterns: 948
Runtime: 0.030440568923950195
Patterns Support
0 forgot 3
1 mr 3
2 appreciate 3
3 death 3
4 death jim 3
.. ... ...
943 would 28
944 will 28
945 will favorite 3
946 go 28
947 favorite 30
[948 rows x 2 columns]
Frequent patterns were generated successfully using frequentPatternGrowth algorithm
not_nostalgia
Total No of patterns: 730
Runtime: 0.014939546585083008
Patterns Support
0 emotional 3
1 fan 3
2 30 3
3 blessing 3
4 december 3
.. ... ...
725 classic 21
726 them 21
727 them every 4
728 lyrics 21
729 lyrics every 3
[730 rows x 2 columns]
| 00 | 000 | 045 | 07 | 10 | 100 | 10m | 11 | 11th | 12 | ... | later ever | later year | later been | make cry | make where | make them | hearing away | missed today | country favorite | lyrics every | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1494 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1495 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1496 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1497 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1498 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1499 rows × 4784 columns
# 使用 FAE_topK k=800 去找出模型訓練的特徵
frequentPatternsDF_topK = {}
for category in categories:
category_save = category.replace(' ', '_')
frequentPatternsDF_topK[category] = FAE_topK(800,category_save)
print (frequentPatternsDF_topK[category])
final_pattern_df_topK = pattern_integrate(frequentPatternsDF_topK)
augmented_df_topK = augmented_df_generation(final_pattern_df_topK)
augmented_df_topK
TopK frequent patterns were successfully generated using FAE algorithm.
nostalgia
Total No of patterns: 800
Runtime: 0.39636778831481934
Patterns Support
0 favorite 30
1 ever 28
2 would 28
3 will 28
4 go 28
.. ... ...
795 over get 3
796 over country 3
797 over which 3
798 over pop 3
799 over perfect 3
[800 rows x 2 columns]
TopK frequent patterns were successfully generated using FAE algorithm.
not_nostalgia
Total No of patterns: 800
Runtime: 0.282914400100708
Patterns Support
0 elvis 21
1 every 21
2 loved 21
3 classic 21
4 them 21
.. ... ...
795 difference 2
796 nine 2
797 slap 2
798 naughty 2
799 needs 2
[800 rows x 2 columns]
| 00 | 000 | 045 | 07 | 10 | 100 | 10m | 11 | 11th | 12 | ... | fall | describes | compose | memorable | genre | amazingly | sweetest | arms | cruel | needs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1494 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1495 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1496 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1497 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1498 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1499 rows × 4700 columns
# 使用 MaxFPGrowth minsup = 3 去找出模型訓練的特徵
frequentPatternsDF_max = {}
for category in categories:
category_save = category.replace(' ', '_')
frequentPatternsDF_max[category] = FPGrowth_max(3,category_save)
print (frequentPatternsDF_max[category])
final_pattern_df_max = pattern_integrate(frequentPatternsDF_max)
augmented_df_max = augmented_df_generation(final_pattern_df_max)
augmented_df_max
Maximal Frequent patterns were generated successfully using MaxFp-Growth algorithm
nostalgia
Total No of patterns: 682
Runtime: 0.03948616981506348
Patterns Support
0 skating 3
1 walker 3
2 scott 3
3 17 1987 3
4 stop 3
.. ... ...
677 will such 4
678 ever only 3
679 would only 4
680 ever kid 3
681 favorite will 3
[682 rows x 2 columns]
Maximal Frequent patterns were generated successfully using MaxFp-Growth algorithm
not_nostalgia
Total No of patterns: 592
Runtime: 0.0324559211730957
Patterns Support
0 thinks 3
1 months 3
2 currently 3
3 kids 3
4 wait 3
.. ... ...
587 days 20
588 every lyrics 3
589 every them 4
590 classic 21
591 loved 21
[592 rows x 2 columns]
| 00 | 000 | 045 | 07 | 10 | 100 | 10m | 11 | 11th | 12 | ... | wish could see | ever boy | us too | been too | about too | listened singer | no singer | well singer | since singer | since got | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1494 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1495 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1496 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1497 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1498 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1499 rows × 4710 columns
3.3.4 Dimensionality Reduction¶
2D by PCA, t-SNE, UMAP¶
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
col = ['coral', 'blue']
def Dimensionality_2D(now_df):
X_pca = PCA(n_components=2).fit_transform(now_df.values)
X_tsne = TSNE(n_components=2).fit_transform(now_df.values)
X_umap = umap.UMAP(n_components=2).fit_transform(now_df.values)
return X_pca, X_tsne, X_umap
def plot_scatter(ax, X_reduced, title):
for c, category in zip(col, categories):
xs = X_reduced[X['sentiment_name'] == category].T[0]
ys = X_reduced[X['sentiment_name'] == category].T[1]
ax.scatter(xs, ys, c=c, marker='o', label=category)
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_title(title)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend(loc='upper right')
def draw_2D_plt(X_pca, X_tsne, X_umap):
plt.close()
fig, axes = plt.subplots(1, 3, figsize=(30, 10))
fig.suptitle('PCA, t-SNE, and UMAP Comparison')
plot_scatter(axes[0], X_pca, 'PCA')
plot_scatter(axes[1], X_tsne, 't-SNE')
plot_scatter(axes[2], X_umap, 'UMAP')
plt.show()
X_pca, X_tsne, X_umap = Dimensionality_2D(tdm_df)
draw_2D_plt(X_pca, X_tsne, X_umap)
X_pca, X_tsne, X_umap = Dimensionality_2D(augmented_df_minsup) # minsup FPGrowth
X_pca.shape
(1499, 2)
draw_2D_plt(X_pca, X_tsne, X_umap)
3D by PCA, t-SNE, UMAP¶
from mpl_toolkits.mplot3d import Axes3D
def Dimensionality_3D(now_df):
X_pca = PCA(n_components=3).fit_transform(now_df.values)
X_tsne = TSNE(n_components=3).fit_transform(now_df.values)
X_umap = umap.UMAP(n_components=3).fit_transform(now_df.values)
return X_pca, X_tsne, X_umap
X_pca_minsup, X_tsne_minsup, X_umap_minsup = Dimensionality_3D(augmented_df_minsup)
X_pca_topK, X_tsne_topK, X_umap_topK = Dimensionality_3D(augmented_df_topK)
X_pca_max, X_tsne_max, X_umap_max = Dimensionality_3D(augmented_df_max)
angle_3D = [[0, 15, 90], [0, 60, 120]]
# Define a function to create 3D scatter plot
def plot_scatter_3d(ax, X_reduced, title):
for c, category in zip(col, categories):
xs = X_reduced[X['sentiment_name'] == category][:, 0]
ys = X_reduced[X['sentiment_name'] == category][:, 1]
zs = X_reduced[X['sentiment_name'] == category][:, 2]
ax.scatter(xs, ys, zs, c=c, marker='o', label=category)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.legend(loc='upper right')
ax.set_title(title)
# 建立模型三維圖片(PCA)
plt.close()
fig = plt.figure(figsize=(30, 30)) # 創建一個大圖形以容納九個子圖
fig.suptitle('PCA 3D from Three Angles and Three Augmented DataFrames')
# 第一組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 1 + i, projection='3d') # 3行3列
plot_scatter_3d(ax, X_pca_minsup, f'PCA_minsup ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
# 第二組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 4 + i, projection='3d') # 3行3列,接下來的3個子圖
plot_scatter_3d(ax, X_pca_topK, f'PCA_topK ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
# 第三組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 7 + i, projection='3d') # 3行3列,最後的3個子圖
plot_scatter_3d(ax, X_pca_max, f'PCA_max ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
plt.show()
# 建立模型三維圖片(PCA)
plt.close()
fig = plt.figure(figsize=(30, 30)) # 創建一個大圖形以容納九個子圖
fig.suptitle('t-SNE 3D from Three Angles and Three Augmented DataFrames')
# 第一組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 1 + i, projection='3d') # 3行3列
plot_scatter_3d(ax, X_tsne_minsup, f't-SNE_minsup ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
# 第二組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 4 + i, projection='3d') # 3行3列,接下來的3個子圖
plot_scatter_3d(ax, X_tsne_topK, f't-SNE_topK ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
# 第三組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 7 + i, projection='3d') # 3行3列,最後的3個子圖
plot_scatter_3d(ax, X_tsne_max, f't-SNE_max ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
plt.show()
# 建立模型三維圖片(UMAP)
plt.close()
fig = plt.figure(figsize=(30, 30)) # 創建一個大圖形以容納九個子圖
fig.suptitle('UMAP 3D from Three Angles and Three Augmented DataFrames')
# 第一組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 1 + i, projection='3d') # 3行3列
plot_scatter_3d(ax, X_umap_minsup, f'UMAP_minsup ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
# 第二組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 4 + i, projection='3d') # 3行3列,接下來的3個子圖
plot_scatter_3d(ax, X_umap_topK, f'UMAP_topK ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
# 第三組子圖
for i in range(3):
ax = fig.add_subplot(3, 3, 7 + i, projection='3d') # 3行3列,最後的3個子圖
plot_scatter_3d(ax, X_umap_max, f'UMAP_max ({angle_3D[0][i]}, {angle_3D[1][i]})')
ax.view_init(angle_3D[0][i], angle_3D[1][i]) # 設定視角
plt.show()
在第一行的圖表表明的是FPGrowth與Supmin=3的情況,在PCA與t-SNE看不太出來他們的分類,UMAP的圖形可以比較看的出來,藍色分布一邊,橘色分布一邊
在第二行的圖表表明的是FEA top 800的情況,在t-SNE是所有裡面分布最有差異的,其他都與第一行的情形差不多
在第三行的圖表表明的是MaxFPGrowth與Supmin=3的情況,在PCA依舊看不太出來他們的分類與前兩種方法差不多,而在t-SNE則是更加密集完全看不出,而在UMAP的圖形則是在第三個角度可以明顯發現兩個顏色各靠一邊
3.3.5 Discretization and Binarization¶
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
mlb = preprocessing.LabelBinarizer()
mlb.fit(X.sentiment)
LabelBinarizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LabelBinarizer()
X['bin_sentiment'] = mlb.transform(X['sentiment']).tolist()
X[0:9]
| sentiment | comment | sentiment_name | unigrams | tokenized_comment | bin_sentiment | |
|---|---|---|---|---|---|---|
| 0 | 1 | He was a singer with a golden voice that I lov... | not nostalgia | [He, was, a, singer, with, a, golden, voice, t... | {love, emotional, at, great, vouch, You, age, ... | [1] |
| 1 | 0 | The mist beautiful voice ever I listened to hi... | nostalgia | [The, mist, beautiful, voice, ever, I, listene... | {love, Never, I, when, voice, The, an, and, hi... | [0] |
| 2 | 0 | I have most of Mr. Reeves songs. Always love ... | nostalgia | [I, have, most, of, Mr., Reeves, songs, ., Alw... | {so, love, comforting, people, sounds, up, wer... | [0] |
| 3 | 1 | 30 day leave from 1st tour in Viet Nam to conv... | not nostalgia | [30, day, leave, from, 1st, tour, in, Viet, Na... | {back, be, me", God, 30, receive., "marry, 1st... | [1] |
| 4 | 0 | listening to his songs reminds me of my mum wh... | nostalgia | [listening, to, his, songs, reminds, me, of, m... | {reminds, of, songs, played, listening, mum, m... | [0] |
| 5 | 0 | Every time I heard this song as a child, I use... | nostalgia | [Every, time, I, heard, this, song, as, a, chi... | {death,, reminded, got, child,, time, song., E... | [0] |
| 6 | 0 | My dad loved listening to Jim Reeves, when I w... | nostalgia | [My, dad, loved, listening, to, Jim, Reeves, ,... | {back, changes, listening, Time, loved, do, I,... | [0] |
| 7 | 0 | i HAVE ALSO LISTENED TO Jim Reeves since child... | nostalgia | [i, HAVE, ALSO, LISTENED, TO, Jim, Reeves, sin... | {love, 71, he, nostalgic, LISTENED, I, ALSO, J... | [0] |
| 8 | 1 | Wherever you are you always in my heart | not nostalgia | [Wherever, you, are, you, always, in, my, heart] | {Wherever, in, you, my, are, always, heart} | [1] |
3.4 TF-IDF¶
3.4.1 Feature subset selection¶
from sklearn.feature_extraction.text import TfidfVectorizer
TFIDF_vect = TfidfVectorizer()
X_TFIDF = TFIDF_vect.fit_transform(X.comment)
TFIDF_terms = TFIDF_vect.get_feature_names_out()
TFIDF_df = pd.DataFrame(X_TFIDF.toarray(), columns=TFIDF_terms, index=X.index)
TFIDF_df.shape
(1499, 3730)
# 進行目前特徵的矩陣做一些觀察
plot_x = ["term_"+str(i) for i in TFIDF_terms[0:20]]
plot_y = ["doc_"+ str(i) for i in list(X.index)[0:20]]
plot_z = X_TFIDF[0:20, 0:20].toarray() # X_counts[how many documents, how many terms]
# 使用熱圖觀察
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(10, 5))
ax = sns.heatmap(df_todraw,
cmap="PuRd", #熱圖的顏色映射為粉紅色調
vmin=0, annot=True) #annot 熱圖的每個格子中顯示數據值
plt.show()
可以看見前20個特徵在前20個文檔的可用性很低
3.4.2 Attribute Aggregation¶
找出各類別的原始特徵¶
from sklearn.feature_selection import VarianceThreshold
# 定義生成並篩選稀疏陣列的函數
def create_term_document_df_TfidfVector(df, threshold=0.0, min_df=0.0, max_df=1.0):
TFIDF_vect_temp = TfidfVectorizer(min_df=min_df, max_df=max_df)
X_TFIDF_temp = TFIDF_vect_temp.fit_transform(df['comment'])
selector = VarianceThreshold(threshold=threshold)
X_selected = selector.fit_transform(X_TFIDF_temp.toarray())
selected_features = TFIDF_vect_temp.get_feature_names_out()[selector.get_support()]
term_document_df_temp = pd.DataFrame(X_selected, columns=selected_features)
return term_document_df_temp
# 分別對兩個類別找屬於他們的特徵
filt_term_document_dfs_TFIDF = {}
for category in categories:
filt_term_document_dfs_TFIDF[category] = create_term_document_df_TfidfVector(category_dfs[category])
# 顯示稀疏陣列
for category in categories:
print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
print(filt_term_document_dfs_TFIDF[category])
Filtered Term-Document Frequency DataFrame for Category nostalgia:
07 10 11 11th 12 13 14 15 16 17 ... young \
0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
1 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.135932 ... 0.0
2 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
3 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
4 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
.. ... ... ... ... ... ... ... ... ... ... ... ...
745 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
746 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
747 0.0 0.0 0.0 0.0 0.0 0.225266 0.0 0.0 0.0 0.000000 ... 0.0
748 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
749 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.0
younger youngster your yours youth youthful youtube yrs \
0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.196577 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... ... ... ...
745 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
746 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
747 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
748 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
749 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0
yup
0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 0.000000
.. ...
745 0.355567
746 0.000000
747 0.000000
748 0.000000
749 0.000000
[750 rows x 2295 columns]
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
00 000 045 10 100 10m 11 12 14 15 ... youngest \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
.. ... ... ... ... ... ... ... ... ... ... ... ...
744 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
745 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
746 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
747 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
748 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
youngsters your yourself youth youtube yrs yuo zealand zulus
0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... ... ... ... ...
744 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
745 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
746 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
747 0.0 0.244427 0.0 0.0 0.0 0.0 0.0 0.0 0.0
748 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[749 rows x 2602 columns]
for category in categories:
word_counts = filt_term_document_dfs_TFIDF[category].sum(axis=0).to_numpy()
plt.close()
plt.figure(figsize=(10, 6))
plt.hist(word_counts, bins=100,color='blue', edgecolor='black')
plt.title(f'Term Frequency Distribution for Category {category}')
plt.xlabel('Frequency')
plt.ylabel('Number of Terms')
plt.show()
選擇各類別中比較有用的特徵¶
select_dfs_TFIDF = {}
for category in categories:
select_dfs_TFIDF[category] = create_term_document_df_TfidfVector(category_dfs[category],0.0005)
print(f"Filtered Term-Document Frequency DataFrame for Category {category}:")
print(select_dfs_TFIDF[category])
Filtered Term-Document Frequency DataFrame for Category nostalgia:
10 12 13 14 16 17 18 1963 1966 1973 ... year \
0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
1 0.0 0.0 0.000000 0.0 0.0 0.135932 0.0 0.0 0.0 0.0 ... 0.0
2 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
3 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
4 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
.. ... ... ... ... ... ... ... ... ... ... ... ...
745 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
746 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
747 0.0 0.0 0.225266 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
748 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
749 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0
years yes yesterday you young younger your youth yrs
0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
1 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
2 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
3 0.000000 0.0 0.0 0.0 0.0 0.0 0.196577 0.0 0.0
4 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
.. ... ... ... ... ... ... ... ... ...
745 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
746 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
747 0.109805 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
748 0.076134 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
749 0.174062 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0
[750 rows x 437 columns]
Filtered Term-Document Frequency DataFrame for Category not nostalgia:
16 2019 50 60 60s about absolutely actress actually after \
0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
.. ... ... ... ... ... ... ... ... ... ...
744 0.0 0.0 0.0 0.0 0.0 0.255993 0.0 0.0 0.0 0.0
745 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
746 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
747 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
748 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
... wow wrong year years yes yet you young your \
0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.157347 0.0 0.000000
1 ... 0.0 0.0 0.0 0.094608 0.0 0.0 0.057036 0.0 0.000000
2 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.458635 0.0 0.000000
3 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000
4 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000
.. ... ... ... ... ... ... ... ... ... ...
744 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.176040 0.0 0.000000
745 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000
746 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000
747 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.297079 0.0 0.244427
748 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.144532 0.0 0.000000
youtube
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. ...
744 0.0
745 0.0
746 0.0
747 0.0
748 0.0
[749 rows x 435 columns]
# 刪除同時出現在兩個類別的特徵
# 初始化一個字典來儲存每個類別的篩選後特徵列名稱
filtered_columns = {}
# 逐類別進行篩選並儲存篩選後的列名稱
for category in categories:
filtered_columns[category] = select_dfs_TFIDF[category].columns.tolist()
print(f"Filtered columns for Category {category} and len is {len(filtered_columns[category])}):")
print(filtered_columns[category])
unique_filtered_columns = set(filtered_columns[categories[0]]).union(set(filtered_columns[categories[1]]))
merged_filtered_columns = list(unique_filtered_columns)
# 顯示合併後的特徵列名稱
print(f"Merged unique filtered columns and len is {len(merged_filtered_columns)}:")
print(merged_filtered_columns)
Filtered columns for Category nostalgia and len is 437): ['10', '12', '13', '14', '16', '17', '18', '1963', '1966', '1973', '1975', '20', '2018', '2019', '30', '40', '50', '50s', '55', '56', '60', '60s', '70', '70s', '80', '80s', '90', 'about', 'absolutely', 'actually', 'adore', 'after', 'afternoon', 'again', 'age', 'ago', 'album', 'alive', 'all', 'almost', 'always', 'am', 'amazing', 'an', 'and', 'another', 'anymore', 'are', 'around', 'artists', 'as', 'at', 'ate', 'away', 'awesome', 'back', 'be', 'beautiful', 'because', 'been', 'before', 'being', 'best', 'better', 'big', 'billy', 'bless', 'born', 'both', 'boy', 'boyfriend', 'brenda', 'brilliant', 'bring', 'bringing', 'brings', 'brother', 'brought', 'but', 'by', 'came', 'can', 'car', 'carl', 'cassette', 'changed', 'child', 'childhood', 'classic', 'clearly', 'club', 'come', 'coming', 'could', 'country', 'cry', 'crying', 'dad', 'daddy', 'damn', 'dance', 'danced', 'dancing', 'date', 'day', 'days', 'deceased', 'definitely', 'did', 'didn', 'died', 'do', 'don', 'during', 'each', 'early', 'elvis', 'end', 'engelbert', 'era', 'especially', 'even', 'ever', 'every', 'everyday', 'everyone', 'everything', 'everytime', 'evokes', 'ex', 'eyes', 'family', 'fantastic', 'fast', 'father', 'favorite', 'favorites', 'feel', 'feeling', 'felt', 'few', 'finally', 'find', 'first', 'flies', 'for', 'forever', 'forget', 'forgot', 'friend', 'friends', 'from', 'full', 'germany', 'get', 'getting', 'girl', 'girlfriend', 'girls', 'glad', 'go', 'god', 'gone', 'good', 'got', 'grade', 'grandma', 'grandmother', 'grandparents', 'grannys', 'great', 'grew', 'group', 'grow', 'growing', 'had', 'hahaha', 'happiest', 'happy', 'hard', 'has', 'have', 'he', 'hear', 'heard', 'hearing', 'heart', 'heaven', 'her', 'here', 'high', 'him', 'his', 'hit', 'holiday', 'house', 'how', 'humperdinck', 'if', 'in', 'into', 'is', 'it', 'its', 'jim', 'july', 'june', 'just', 'karen', 'kid', 'kind', 'know', 'lady', 'lane', 'last', 'late', 'later', 'laura', 'learned', 'left', 'life', 'like', 'liked', 'listen', 'listened', 'listening', 'little', 'live', 'lol', 'lonely', 'long', 'looking', 'lot', 'lots', 'love', 'loved', 'lovely', 'lyrics', 'machine', 'made', 'make', 'makes', 'mama', 'man', 'many', 'marvellous', 'marvelous', 'mary', 'me', 'memories', 'memory', 'met', 'mid', 'mind', 'mine', 'miracles', 'miss', 'missed', 'mom', 'moments', 'more', 'morning', 'most', 'mother', 'much', 'mum', 'mums', 'music', 'my', 'name', 'need', 'never', 'new', 'nice', 'night', 'no', 'nostalgia', 'nostalgic', 'not', 'nothing', 'now', 'of', 'oh', 'old', 'older', 'oldies', 'omg', 'on', 'once', 'one', 'only', 'or', 'our', 'out', 'over', 'parents', 'part', 'party', 'passed', 'past', 'people', 'pictures', 'pilot', 'play', 'played', 'player', 'playing', 'please', 'posting', 'radio', 'real', 'really', 'record', 'reeves', 'remember', 'remembered', 'remembering', 'remind', 'reminded', 'reminds', 'reminiscing', 'right', 'rip', 'rock', 'sad', 'same', 'sang', 'saturday', 'say', 'school', 'sears', 'see', 'senior', 'sentimentality', 'she', 'simpler', 'since', 'sing', 'singer', 'singing', 'sister', 'skating', 'sleep', 'so', 'some', 'someone', 'song', 'songs', 'soul', 'sounds', 'special', 'still', 'such', 'summer', 'sunday', 'sure', 'sweet', 'take', 'takes', 'tape', 'tears', 'teen', 'teenager', 'than', 'thank', 'thanks', 'that', 'thats', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'things', 'think', 'this', 'those', 'though', 'thought', 'time', 'timeless', 'times', 'to', 'today', 'too', 'top', 'track', 'tune', 'tv', 'understand', 'until', 'up', 'us', 'usa', 'use', 'used', 'very', 'video', 'voice', 'want', 'was', 'wasn', 'way', 'we', 'wedding', 'well', 'were', 'what', 'when', 'whenever', 'where', 'which', 'while', 'who', 'why', 'will', 'wish', 'with', 'woman', 'wonderful', 'words', 'world', 'would', 'wow', 'ya', 'year', 'years', 'yes', 'yesterday', 'you', 'young', 'younger', 'your', 'youth', 'yrs'] Filtered columns for Category not nostalgia and len is 435): ['16', '2019', '50', '60', '60s', 'about', 'absolutely', 'actress', 'actually', 'after', 'again', 'age', 'ago', 'agree', 'all', 'almost', 'also', 'always', 'am', 'amazing', 'an', 'and', 'another', 'any', 'anybody', 'anymore', 'anyone', 'anything', 'appreciate', 'appreciated', 'are', 'around', 'artist', 'as', 'at', 'awesome', 'baby', 'back', 'background', 'bad', 'band', 'be', 'beat', 'beautiful', 'beauty', 'because', 'been', 'before', 'believe', 'best', 'better', 'billy', 'bit', 'bless', 'born', 'bought', 'boy', 'break', 'brenda', 'brilliant', 'brothers', 'but', 'by', 'called', 'came', 'can', 'certainly', 'change', 'childhood', 'class', 'classic', 'classics', 'close', 'come', 'comma', 'comment', 'compose', 'concert', 'could', 'country', 'course', 'crap', 'cry', 'crying', 'dance', 'dancing', 'daughter', 'day', 'days', 'dear', 'did', 'didn', 'different', 'disco', 'do', 'does', 'don', 'done', 'down', 'dynamite', 'early', 'earth', 'else', 'elvis', 'emotion', 'end', 'englebert', 'english', 'enjoy', 'era', 'especially', 'even', 'ever', 'every', 'everyday', 'everyone', 'everything', 'eyes', 'falling', 'family', 'fantastic', 'favorite', 'favorites', 'feel', 'feeling', 'female', 'few', 'find', 'first', 'for', 'forever', 'forget', 'found', 'friend', 'from', 'full', 'future', 'generation', 'generations', 'get', 'girl', 'give', 'glad', 'go', 'god', 'goes', 'gold', 'golden', 'gone', 'gonna', 'good', 'gorgeous', 'got', 'great', 'greatest', 'grew', 'guy', 'guys', 'had', 'handsome', 'hank', 'happened', 'happy', 'has', 'have', 'he', 'head', 'hear', 'heard', 'hearing', 'heart', 'heaven', 'her', 'here', 'him', 'his', 'history', 'hit', 'hits', 'home', 'hope', 'how', 'if', 'images', 'in', 'into', 'intro', 'irreplaceable', 'is', 'it', 'its', 'just', 'keep', 'kind', 'king', 'know', 'lady', 'last', 'late', 'laura', 'learn', 'least', 'leave', 'left', 'legend', 'let', 'life', 'like', 'listen', 'listened', 'listening', 'little', 'live', 'lived', 'll', 'lol', 'lonely', 'long', 'look', 'looking', 'looks', 'loss', 'lost', 'lot', 'love', 'loved', 'lovely', 'loving', 'lyrics', 'made', 'magnificent', 'make', 'makes', 'man', 'mans', 'many', 'masterpiece', 'matter', 'may', 'me', 'mean', 'meaning', 'melody', 'men', 'mind', 'miss', 'mom', 'moment', 'more', 'most', 'movie', 'much', 'music', 'my', 'na', 'name', 'never', 'new', 'nice', 'no', 'not', 'nothing', 'now', 'nowadays', 'of', 'off', 'oh', 'old', 'on', 'once', 'one', 'ones', 'only', 'or', 'original', 'others', 'our', 'out', 'over', 'paint', 'parents', 'part', 'peace', 'people', 'perfect', 'performance', 'person', 'pictures', 'play', 'played', 'please', 'pleasure', 'pop', 'posting', 'prefer', 'pretty', 'pure', 'put', 'radio', 're', 'read', 'real', 'really', 'record', 'recordings', 'remains', 'rest', 'right', 'rock', 'roll', 'romantic', 'roy', 'sad', 'same', 'sang', 'say', 'says', 'screen', 'see', 'seen', 'sharing', 'she', 'should', 'sing', 'singer', 'singers', 'singing', 'single', 'sings', 'so', 'some', 'someone', 'something', 'song', 'songs', 'sorrow', 'soul', 'sound', 'sounds', 'special', 'stars', 'started', 'still', 'such', 'sung', 'supernatural', 'sure', 'take', 'talented', 'taste', 'tears', 'tell', 'teresa', 'terrific', 'than', 'thank', 'thanks', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'this', 'those', 'though', 'thought', 'time', 'timeless', 'times', 'titans', 'to', 'today', 'told', 'too', 'touching', 'true', 'truly', 'tune', 'understand', 'unique', 'until', 'untouchable', 'up', 'us', 'used', 've', 'version', 'very', 'video', 'vocals', 'voice', 'voices', 'wake', 'want', 'was', 'way', 'we', 'well', 'were', 'what', 'when', 'where', 'wherever', 'which', 'who', 'why', 'wife', 'will', 'wish', 'wished', 'with', 'without', 'woman', 'wonder', 'wonderful', 'words', 'work', 'world', 'would', 'wow', 'wrong', 'year', 'years', 'yes', 'yet', 'you', 'young', 'your', 'youtube'] Merged unique filtered columns and len is 602: ['left', 'were', '18', 'germany', 'listening', 'afternoon', 'unique', 'age', 'school', 'humperdinck', 'just', 'earth', 'says', '70s', 'recordings', 'nothing', 'adore', 'greatest', 'lovely', 'life', 'mind', 'its', 'miracles', 'away', 'many', 'bring', '2018', 'singer', 'once', 'no', '55', 'pure', 'beauty', 'player', 'hard', 'about', 'bringing', 'senior', 'roll', 've', 'certainly', 'loved', 'truly', 'performance', 'lane', 'playing', 'and', 'each', 'pop', 'old', 'dear', '70', 'paint', 'men', 'new', 'us', 'never', 'wrong', 'beautiful', 'ever', 'our', 'screen', 'get', 'crap', 'roy', 'single', 'saturday', 'early', 'here', 'legend', 'but', 'teenager', 'sweet', 'takes', 'grannys', 'sung', 'sears', 'wonderful', 'do', 'woman', 'billy', 'ex', 'voice', 'morning', 'skating', 'thing', 'top', 'goes', 'moments', 'feel', 'sure', '16', 'dynamite', 'images', 'wow', 'machine', 'hear', 'heard', 'know', 'had', 'child', 'pilot', 'older', 'to', 'year', 'anymore', 'high', 'mums', 'na', 'grandmother', 'wished', 'listen', 'mine', 'radio', 'history', 'makes', 'another', 'if', 'got', 'at', 'friend', 'great', 'such', 'nostalgic', 'party', 'country', 'person', 'almost', 'enjoy', 'supernatural', 'something', 'always', 'actress', 'songs', 'happened', 'have', 'they', 'singers', 'without', 'intro', 'best', 'dancing', 'leave', 'lived', 'loss', 'mean', 'beat', 'kid', 'we', 'grew', 'englebert', 'melody', 'too', 'still', 'oldies', '10', 'sad', 'baby', 'where', 'hope', 'since', 'close', 'real', 'generation', 'voices', 'handsome', 'understand', 'find', '50', 'video', 'should', 'remind', 'lady', 'reminiscing', 'matter', 'eyes', 'days', 'anybody', 'why', 'memory', 'usa', 'comment', 'it', 'hahaha', 'their', 'grandparents', 'guys', 'others', 'ones', 'she', 'is', 'album', 'even', 'bless', 'brothers', 'generations', 'reminds', 'thank', 'into', 'today', 'now', 'happiest', 'around', 'else', 'terrific', 'also', 'missed', 'let', 'karen', 'keep', 'not', 'mary', 'sister', 'class', 'mother', 'every', 'when', 'part', 'younger', '1966', 'yrs', 'lonely', 'disco', 'very', 'pictures', '13', 'much', 'the', 'forgot', 'whenever', 'fast', 'world', 'version', 'deceased', 'emotion', 'didn', 'give', 'in', 'course', 'say', 'music', 'movie', 'flies', 'found', 'day', 'your', 're', 'because', 'summer', 'talented', 'that', 'ate', 'most', 'looks', 'end', 'only', 'yesterday', 'group', 'first', 'hank', 'gonna', 'good', '1963', 'name', 'awesome', 'right', 'parents', 'date', '14', '2019', 'forever', 'everyone', 'simpler', 'time', 'mama', 'before', 'an', 'same', 'cassette', 'how', 'feeling', 'believe', 'looking', 'gorgeous', 'this', 'any', '56', 'which', 'omg', 'hits', 'man', 'mum', 'wasn', '40', 'teen', 'bought', 'favorite', 'original', 'everytime', 'girl', 'brenda', 'peace', 'track', 'make', 'read', 'remains', 'yes', 'glad', 'anything', 'wish', 'them', 'her', 'youtube', 'ya', 'grade', 'born', 'clearly', 'tape', 'lol', 'learn', 'club', 'daughter', 'see', 'liked', 'happy', 'or', 'crying', 'need', 'appreciate', 'pretty', 'sharing', 'least', 'times', 'back', 'lost', 'lyrics', 'after', 'sounds', 'full', 'up', 'he', 'like', 'touching', 'heaven', 'head', 'met', 'wherever', 'felt', 'favorites', 'appreciated', 'brother', 'by', 'thanks', 'family', 'marvellous', 'passed', 'rip', 'few', 'holiday', 'years', 'magnificent', 'later', 'take', 'rock', 'go', 'one', 'can', 'father', 'titans', 'ago', 'singing', 'been', 'you', 'teresa', 'thought', 'out', 'than', 'put', 'nice', 'untouchable', 'did', 'anyone', 'heart', 'want', 'being', 'bit', 'pleasure', 'brings', 'tv', 'compose', 'especially', 'all', 'better', 'house', 'those', 'future', 'july', 'stars', 'sing', 'me', 'laura', 'will', 'masterpiece', 'who', 'absolutely', 'his', 'prefer', 'remembering', 'be', 'love', 'people', 'concert', 'think', 'wedding', 'during', 'then', 'don', 'everything', 'miss', 'grandma', 'tune', 'work', '80', 'll', 'boy', 'falling', 'irreplaceable', 'learned', 'perfect', 'brought', 'words', 'female', 'though', 'as', '80s', 'brilliant', 'really', 'record', 'youth', 'late', 'was', 'well', 'posting', 'mom', 'june', 'mid', 'wife', 'sunday', 'reeves', '60s', 'has', 'bad', 'coming', 'growing', '17', 'home', 'someone', 'classics', 'god', 'died', 'artists', 'big', 'definitely', 'last', 'made', 'what', 'things', 'soul', 'him', 'loving', 'look', 'young', 'please', 'little', 'use', '60', 'listened', 'over', 'on', 'am', 'used', 'childhood', 'cry', 'of', 'moment', 'carl', 'night', 'play', 'lot', 'background', 'off', 'reminded', 'girlfriend', 'vocals', 'tell', 'more', '1975', 'long', 'done', 'change', 'come', 'called', 'wonder', 'may', 'romantic', 'remember', 'nowadays', 'could', 'hearing', 'my', 'english', 'alive', 'oh', 'special', 'changed', 'actually', 'thats', '20', 'kind', 'sorrow', 'getting', 'daddy', 'friends', 'tears', 'yet', 'there', 'again', 'elvis', 'taste', 'memories', '12', 'girls', 'live', 'sentimentality', 'are', 'sang', 'danced', 'mans', 'guy', 'car', 'sleep', 'down', 'told', 'would', 'past', 'lots', 'these', 'until', 'golden', 'agree', 'seen', 'song', 'jim', 'classic', 'break', 'remembered', 'grow', 'does', 'for', 'band', 'gone', 'sound', 'with', '90', 'era', 'hit', 'so', 'artist', 'comma', 'true', '30', 'king', 'while', 'engelbert', 'everyday', 'some', 'dance', 'played', 'nostalgia', 'timeless', 'amazing', 'finally', 'boyfriend', 'came', 'meaning', 'way', 'both', '50s', 'from', 'marvelous', 'sings', 'forget', 'different', 'fantastic', 'started', 'damn', 'wake', 'gold', '1973', 'dad', 'rest', 'evokes']
vectorizer_combined = TfidfVectorizer(vocabulary=merged_filtered_columns)
tfidf_combined_matrix = vectorizer_combined.fit_transform(X.comment)
tfidf_combined_array = tfidf_combined_matrix.toarray()
combined_df_TFIDF = pd.DataFrame(tfidf_combined_array, columns=vectorizer_combined.get_feature_names_out())
print(combined_df_TFIDF)
left were 18 germany listening afternoon unique age \
0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.176167
1 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000
2 0.0 0.125393 0.0 0.0 0.000000 0.0 0.0 0.150142
3 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000
4 0.0 0.000000 0.0 0.0 0.280711 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ...
1494 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000
1495 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000
1496 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.309594
1497 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.350538
1498 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000
school humperdinck ... different fantastic started damn wake \
0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
1 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
2 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
3 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
4 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
... ... ... ... ... ... ... ... ...
1494 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
1495 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
1496 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
1497 0.0 0.0 ... 0.0 0.0 0.212079 0.0 0.0
1498 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0
gold 1973 dad rest evokes
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
1494 0.0 0.0 0.0 0.0 0.0
1495 0.0 0.0 0.0 0.0 0.0
1496 0.0 0.0 0.0 0.0 0.0
1497 0.0 0.0 0.0 0.0 0.0
1498 0.0 0.0 0.0 0.0 0.0
[1499 rows x 602 columns]
augmented_df 的製作¶
augmented_df_combined = pd.concat([TFIDF_df, combined_df_TFIDF], axis=1)
augmented_df_combined
| 00 | 000 | 045 | 07 | 10 | 100 | 10m | 11 | 11th | 12 | ... | different | fantastic | started | damn | wake | gold | 1973 | dad | rest | evokes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1494 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1495 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1496 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1497 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.212079 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1498 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1499 rows × 4332 columns
3.3.4 Dimensionality Reduction¶
2D by PCA, t-SNE, UMAP¶
X_pca, X_tsne, X_umap = Dimensionality_2D(augmented_df_combined)
draw_2D_plt(X_pca, X_tsne, X_umap)
2D by Isomap, MDS¶
from sklearn.manifold import Isomap, MDS
def Dimensionality_2D_new(now_df):
X_isomap = Isomap(n_components=2).fit_transform(now_df.values)
X_mds = MDS(n_components=2).fit_transform(now_df.values)
return X_isomap, X_mds
def draw_2D_plt_new(X_isomap, X_mds, categories):
plt.close()
fig, axes = plt.subplots(1, 2, figsize=(20, 10))
fig.suptitle('Isomap and MDS Comparison')
# 繪製每個降維結果
plot_scatter(axes[0], X_isomap, 'Isomap')
plot_scatter(axes[1], X_mds, 'MDS')
plt.show()
X_isomap, X_mds = Dimensionality_2D_new(TFIDF_df)
draw_2D_plt_new(X_isomap, X_mds, categories)
X_isomap, X_mds = Dimensionality_2D_new(augmented_df_combined)
draw_2D_plt_new(X_isomap, X_mds, categories)
4. Data Exploration¶
# We retrieve 3 sentences for a random record
document_to_transform_1 = []
random_record_1 = X.iloc[50]
random_record_1 = random_record_1['comment']
document_to_transform_1.append(random_record_1)
document_to_transform_2 = []
random_record_2 = X.iloc[100]
random_record_2 = random_record_2['comment']
document_to_transform_2.append(random_record_2)
document_to_transform_3 = []
random_record_3 = X.iloc[150]
random_record_3 = random_record_3['comment']
document_to_transform_3.append(random_record_3)
print(document_to_transform_1)
print(document_to_transform_2)
print(document_to_transform_3)
['If I remember correctly, this song came out after Mr. Reeves passed away. I was about 10 years old when the disc jockey said that the news just came over the wire that he died in a plane crash.'] ['i guess most of us leave it too late before we tell someone just how much we really love them'] ['my name is thomas but know by tommy and my wifes name is laura and i always sing this to her']
from sklearn.preprocessing import binarize
# Transform sentence with Vectorizers
document_vector_count_1 = count_vect.transform(document_to_transform_1)
document_vector_count_2 = count_vect.transform(document_to_transform_2)
document_vector_count_3 = count_vect.transform(document_to_transform_3)
# Binarize vectors to simplify: 0 for abscence, 1 for prescence
document_vector_count_1_bin = binarize(document_vector_count_1)
document_vector_count_2_bin = binarize(document_vector_count_2)
document_vector_count_3_bin = binarize(document_vector_count_3)
# print vectors
print("Let's take a look at the count vectors:")
print(document_vector_count_1.todense())
print(document_vector_count_2.todense())
print(document_vector_count_3.todense())
Let's take a look at the count vectors: [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]]
from sklearn.metrics.pairwise import cosine_similarity
# Calculate Cosine Similarity
cos_sim_count_1_2 = cosine_similarity(document_vector_count_1, document_vector_count_2, dense_output=True)
cos_sim_count_1_3 = cosine_similarity(document_vector_count_1, document_vector_count_3, dense_output=True)
cos_sim_count_2_3 = cosine_similarity(document_vector_count_2, document_vector_count_3, dense_output=True)
cos_sim_count_1_1 = cosine_similarity(document_vector_count_1, document_vector_count_1, dense_output=True)
cos_sim_count_2_2 = cosine_similarity(document_vector_count_2, document_vector_count_2, dense_output=True)
cos_sim_count_3_3 = cosine_similarity(document_vector_count_3, document_vector_count_3, dense_output=True)
# Print the cosine similarity values
print("Cosine Similarity using count between 1 and 2: %.4f" % cos_sim_count_1_2[0][0])
print("Cosine Similarity using count between 1 and 3: %.4f" % cos_sim_count_1_3[0][0])
print("Cosine Similarity using count between 2 and 3: %.4f" % cos_sim_count_2_3[0][0])
print("Cosine Similarity using count between 1 and 1: %.4f" % cos_sim_count_1_1[0][0])
print("Cosine Similarity using count between 2 and 2: %.4f" % cos_sim_count_2_2[0][0])
print("Cosine Similarity using count between 3 and 3: %.4f" % cos_sim_count_3_3[0][0])
Cosine Similarity using count between 1 and 2: 0.0322 Cosine Similarity using count between 1 and 3: 0.0279 Cosine Similarity using count between 2 and 3: 0.0000 Cosine Similarity using count between 1 and 1: 1.0000 Cosine Similarity using count between 2 and 2: 1.0000 Cosine Similarity using count between 3 and 3: 1.0000
5. Data Classification¶
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score
category_mapping = dict(X[['sentiment', 'sentiment_name']].drop_duplicates().values)
print(category_mapping)
target_names = [category_mapping[label] for label in sorted(category_mapping.keys())]
print(target_names)
{1: 'not nostalgia', 0: 'nostalgia'}
['nostalgia', 'not nostalgia']
def Bernoulli_model(X_train, X_test, y_train, y_test):
model = BernoulliNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=target_names,digits=4)
print("準確率:", accuracy)
print("分類報告:\n", report)
def Multinomial_model(X_train, X_test, y_train, y_test):
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=target_names,digits=4)
print("準確率:", accuracy)
print("分類報告:\n", report)
def Gaussian_model(X_train, X_test, y_train, y_test):
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=target_names,digits=4)
print("準確率:", accuracy)
print("分類報告:\n", report)
5.1 CountVectorizer¶
5.1.1 tdm_df¶
X_train, X_test, y_train, y_test = train_test_split(tdm_df, X['sentiment'], test_size=0.3, random_state=2)
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8866666666666667
分類報告:
precision recall f1-score support
nostalgia 0.8756 0.8955 0.8854 220
not nostalgia 0.8978 0.8783 0.8879 230
accuracy 0.8867 450
macro avg 0.8867 0.8869 0.8867 450
weighted avg 0.8869 0.8867 0.8867 450
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8711111111111111
分類報告:
precision recall f1-score support
nostalgia 0.8293 0.9273 0.8755 220
not nostalgia 0.9216 0.8174 0.8664 230
accuracy 0.8711 450
macro avg 0.8754 0.8723 0.8709 450
weighted avg 0.8764 0.8711 0.8708 450
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.6666666666666666
分類報告:
precision recall f1-score support
nostalgia 0.6144 0.8545 0.7148 220
not nostalgia 0.7778 0.4870 0.5989 230
accuracy 0.6667 450
macro avg 0.6961 0.6708 0.6569 450
weighted avg 0.6979 0.6667 0.6556 450
5.1.2 augmented_df_minsup¶
X_train, X_test, y_train, y_test = train_test_split(augmented_df_minsup, X['sentiment'], test_size=0.3, random_state=2)
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.9044444444444445
分類報告:
precision recall f1-score support
nostalgia 0.9116 0.8909 0.9011 220
not nostalgia 0.8979 0.9174 0.9075 230
accuracy 0.9044 450
macro avg 0.9048 0.9042 0.9043 450
weighted avg 0.9046 0.9044 0.9044 450
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8866666666666667
分類報告:
precision recall f1-score support
nostalgia 0.8477 0.9364 0.8898 220
not nostalgia 0.9324 0.8391 0.8833 230
accuracy 0.8867 450
macro avg 0.8901 0.8877 0.8866 450
weighted avg 0.8910 0.8867 0.8865 450
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.7733333333333333
分類報告:
precision recall f1-score support
nostalgia 0.7500 0.8045 0.7763 220
not nostalgia 0.7991 0.7435 0.7703 230
accuracy 0.7733 450
macro avg 0.7745 0.7740 0.7733 450
weighted avg 0.7751 0.7733 0.7732 450
5.1.3 augmented_df_topK¶
X_train, X_test, y_train, y_test = train_test_split(augmented_df_topK, X['sentiment'], test_size=0.3, random_state=2)
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.9044444444444445
分類報告:
precision recall f1-score support
nostalgia 0.9078 0.8955 0.9016 220
not nostalgia 0.9013 0.9130 0.9071 230
accuracy 0.9044 450
macro avg 0.9046 0.9042 0.9044 450
weighted avg 0.9045 0.9044 0.9044 450
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8777777777777778
分類報告:
precision recall f1-score support
nostalgia 0.8340 0.9364 0.8822 220
not nostalgia 0.9310 0.8217 0.8730 230
accuracy 0.8778 450
macro avg 0.8825 0.8791 0.8776 450
weighted avg 0.8836 0.8778 0.8775 450
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.7111111111111111
分類報告:
precision recall f1-score support
nostalgia 0.6531 0.8727 0.7471 220
not nostalgia 0.8205 0.5565 0.6632 230
accuracy 0.7111 450
macro avg 0.7368 0.7146 0.7051 450
weighted avg 0.7386 0.7111 0.7042 450
5.1.4 augmented_df_max¶
X_train, X_test, y_train, y_test = train_test_split(augmented_df_max, X['sentiment'], test_size=0.3, random_state=2)
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8911111111111111
分類報告:
precision recall f1-score support
nostalgia 0.8977 0.8773 0.8874 220
not nostalgia 0.8851 0.9043 0.8946 230
accuracy 0.8911 450
macro avg 0.8914 0.8908 0.8910 450
weighted avg 0.8913 0.8911 0.8911 450
Multinomial_model(X_train, X_test, y_train, y_test)
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.7511111111111111
分類報告:
precision recall f1-score support
nostalgia 0.7030 0.8500 0.7695 220
not nostalgia 0.8207 0.6565 0.7295 230
accuracy 0.7511 450
macro avg 0.7618 0.7533 0.7495 450
weighted avg 0.7631 0.7511 0.7491 450
5.2 TFIDF¶
5.2.1 TFIDF_df (原始資料)¶
X_train, X_test, y_train, y_test = train_test_split(TFIDF_df, X['sentiment'], test_size=0.3, random_state=2)
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8866666666666667
分類報告:
precision recall f1-score support
nostalgia 0.8756 0.8955 0.8854 220
not nostalgia 0.8978 0.8783 0.8879 230
accuracy 0.8867 450
macro avg 0.8867 0.8869 0.8867 450
weighted avg 0.8869 0.8867 0.8867 450
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8555555555555555
分類報告:
precision recall f1-score support
nostalgia 0.7992 0.9409 0.8643 220
not nostalgia 0.9319 0.7739 0.8456 230
accuracy 0.8556 450
macro avg 0.8656 0.8574 0.8550 450
weighted avg 0.8671 0.8556 0.8547 450
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.66
分類報告:
precision recall f1-score support
nostalgia 0.6184 0.7955 0.6958 220
not nostalgia 0.7305 0.5304 0.6146 230
accuracy 0.6600 450
macro avg 0.6745 0.6629 0.6552 450
weighted avg 0.6757 0.6600 0.6543 450
5.2.2 augmented_df_combined (使用方差)¶
X_train, X_test, y_train, y_test = train_test_split(augmented_df_combined, X['sentiment'], test_size=0.3, random_state=2)
X_test
| 00 | 000 | 045 | 07 | 10 | 100 | 10m | 11 | 11th | 12 | ... | different | fantastic | started | damn | wake | gold | 1973 | dad | rest | evokes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1321 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 903 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.192976 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 1275 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 69 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 272 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 708 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.57711 | 0.0 | 0.0 |
| 60 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 201 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 265 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
| 472 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 |
450 rows × 4332 columns
Bernoulli_model(X_train, X_test, y_train, y_test)
準確率: 0.8955555555555555
分類報告:
precision recall f1-score support
nostalgia 0.8914 0.8955 0.8934 220
not nostalgia 0.8996 0.8957 0.8976 230
accuracy 0.8956 450
macro avg 0.8955 0.8956 0.8955 450
weighted avg 0.8956 0.8956 0.8956 450
Multinomial_model(X_train, X_test, y_train, y_test)
準確率: 0.8777777777777778
分類報告:
precision recall f1-score support
nostalgia 0.8367 0.9318 0.8817 220
not nostalgia 0.9268 0.8261 0.8736 230
accuracy 0.8778 450
macro avg 0.8818 0.8790 0.8776 450
weighted avg 0.8828 0.8778 0.8776 450
Gaussian_model(X_train, X_test, y_train, y_test)
準確率: 0.6688888888888889
分類報告:
precision recall f1-score support
nostalgia 0.6263 0.8000 0.7026 220
not nostalgia 0.7396 0.5435 0.6266 230
accuracy 0.6689 450
macro avg 0.6830 0.6717 0.6646 450
weighted avg 0.6842 0.6689 0.6637 450
¶
Comment: It can be found that the effect of using Bernoulli is better than using Multinomial and Gaussian. This may be because my features appear very rarely in the document, and the latter two are easily affected by high-occurring words.
¶
In data processing, I used two other methods, FAE topK and MFPGrowth, and defined them as new functions for quick invocation with different parameters in the future. By using FAE topK, I can quickly select the most meaningful features, avoiding excessive features that could interfere with model development.